Research Project Report: Spark, BlinkDB and Sampling
نویسنده
چکیده
During the 2015-2016 academic year, I conducted research about Spark, BlinkDB and various sampling techniques. This research helped the team have a better understanding of the capabilities and properties of Spark, BlinkDB system and different sampling technologies. Additionally, I benchmarked and implemented various Machine learning and sampling methods on the top of both Spark and BlinkDB. There are two parts from my work: Part I (Fall 2015) Research on Spark and Part II (Spring 2016) Probability and Sampling Techniques and Systems
منابع مشابه
Blink and It's Done: Interactive Queries on Very Large Data
In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggrega...
متن کاملReview of Diesel Particulate Matter Sampling Methods Final Report
.........................................................................................................................3 INTRODUCTION ................................................................................................................5 DIESEL ENGINE TECHNOLOGY AND EMISSION REGULATIONS .............................7 PHYSICAL AND CHEMICAL NATURE OF DIESEL AEROSOL ......................
متن کاملTokeneer: Beyond Formal Program Verification
Tokeneer is a small-sized (10 kloc) security system which was formally developed and verified by Praxis at the request of NSA, using SPARK technology. Since its open-source release in 2008, only two problems were found, one by static analysis, one by code review. In this paper, we report on experiments where we systematically applied various static analysis tools (compiler, bug-finder, proof to...
متن کاملApproximate Stream Analytics in Apache Flink and Apache Spark Streaming
Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing — based on the chosen sample size — can make a systematic trade-off between the output accuracy and computation effi...
متن کاملParallel Maritime Traffic Clustering Based on Apache Spark
Maritime traffic patterns extraction is an essential part for maritime security and surveillance and DBSCANSD is a density based clustering algorithm extracting the arbitrary shapes of the normal lanes from AIS data. This paper presents a parallel DBSCANSD algorithm on top of Apache Spark. The project is an experimental research work and the results shown in this paper is preliminary. The exper...
متن کامل